Skip to content

Add matmul with transpose#35

Merged
austinvhuang merged 1 commit intoAnswerDotAI:mainfrom
junjihashimoto:feature/transposed-matmul
Jul 31, 2024
Merged

Add matmul with transpose#35
austinvhuang merged 1 commit intoAnswerDotAI:mainfrom
junjihashimoto:feature/transposed-matmul

Conversation

@junjihashimoto
Copy link
Collaborator

This PR implements matrix multiplication with transposed weight.
In case of NVIDIA, it will be 1.5 times faster.

# In case of NVIDIA GeForce RTX 3080 Laptop GPU
$ yes |  MATMUL_VERSION=8  make  | grep 'Kernel version\|GFLOPS'
[info] Dispatching Kernel version 8: 2D blocktiling with loop unrolling and vectorization, 30 iterations ...
113.3 milliseconds / dispatch ~ 2426.99 GFLOPS
$ yes |  MATMUL_VERSION=9  make  | grep 'Kernel version\|GFLOPS'
[info] Dispatching Kernel version 9: 2D blocktiling with loop unrolling, vectorization and transpose, 30 iterations ...
74.7 milliseconds / dispatch ~ 3680.25 GFLOPS

# In case of M2 pro
$ yes |  MATMUL_VERSION=8  make  | grep 'Kernel version\|GFLOPS'
[info] Dispatching Kernel version 8: 2D blocktiling with loop unrolling and vectorization, 30 iterations ...
164.3 milliseconds / dispatch ~ 1672.82 GFLOPS
$ yes |  MATMUL_VERSION=9  make  | grep 'Kernel version\|GFLOPS'
[info] Dispatching Kernel version 9: 2D blocktiling with loop unrolling, vectorization and transpose, 30 iterations ...
160.7 milliseconds / dispatch ~ 1710.18 GFLOPS

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants